Atomic accumulators #1137

penelopeysm · 2025-11-12T01:37:40Z

Warning

This PR is not really meant for merging, it's mainly for discussion.

This PR replaces ThreadSafeVarInfo (which is essentially a vector of accumulators) with atomic accumulators, that is, mutable struct accumulators whose fields can be added to atomically.

(This idea was previously discussed in #1078, and quite a few of the ideas there will reappear here.)

Why?

Central to #1132 is the idea of removing Metadata. Since metadata has been removed, that means that at least for (Fast)LDF purposes, there is no longer anything blocking threaded assume tilde-statements.*

Here is a proof of principle. I want to be super clear in saying that the code changes in this PR are not needed to enable threaded assume. ThreadSafeVarInfo already works correctly, and the snippet below can be run on either #1132 or this PR. Benchmarks are provided below for both.

* There is one blocker, which is that to generate a FastLDF, you need a metadata to read the ranges and linked statuses. But you can't generate the metadata if there are threaded assume statements. The snippet here works around this by manually constructing the ranges. Solving this properly would require some surgery on get_range_and_linked, but should be doable (and indeed I believe that we should ultimately decouple get_range_and_linked from a metadata).

Performance

(This code is very poorly written, btw, it would be much better if expected_logpdf and get_ranges_and_linked dispatched on the model function. And yes the trivial model has a bogus y=nothing just to make those functions work.)

TSVI code is run with #1132, atomic accs code with this PR. Julia 1.11.7. Number of threads as shown below.

using DynamicPPL, Distributions, LogDensityProblems, ForwardDiff, ADTypes, Test, Chairmarks, Test

@model trivial(y=nothing) = x ~ Normal()
@model function threaded_asm_and_obs(y)
    x = Vector{Float64}(undef, length(y))
    Threads.@threads for i in eachindex(y)
        x[i] ~ Normal()
        y[i] ~ Normal(x[i])
    end
end
@model function threaded_obs_only(y)
    x ~ Normal()
    Threads.@threads for i in eachindex(y)
        y[i] ~ Normal(x)
    end
end

function expected_logpdf(x, y)
    p = sum(logpdf.(Normal(), x))
    y === nothing && return p # this is the trivial model
    # handle both models
    lk = if length(x) == length(y)
        sum(logpdf.(Normal.(x), y))
    else
        sum(logpdf.(Normal(x[1]), y))
    end
    return p + lk
end
function expected_grad(x, y)
    return ForwardDiff.gradient(Base.Fix2(expected_logpdf, y), x)
end

### Need to manually construct the ranges, because the default
### behaviour is to generate a VarInfo and extract ranges from that
### -- but generating the VarInfo would fail if the model has
### threaded assume.
struct NotAVarInfo <: DynamicPPL.AbstractVarInfo
    xs::Vector{Float64}
end
function DynamicPPL.Experimental.get_ranges_and_linked(v::NotAVarInfo)
    n_xs = length(v.xs)
    return if n_xs == 50
        all_iden_ranges = NamedTuple()
        all_dict_ranges = Dict{VarName,DynamicPPL.RangeAndLinked}()
        for i in 1:n_xs
            all_dict_ranges[@varname(x[i])] = DynamicPPL.RangeAndLinked(i:i, false)
        end
        all_iden_ranges, all_dict_ranges
    else
        (x=DynamicPPL.RangeAndLinked(1:1, false),), Dict{VarName,DynamicPPL.RangeAndLinked}()
    end
end
Base.getindex(v::NotAVarInfo, ::Colon) = v.xs

s = Threads.nthreads() > 1 ? "s" : ""
println("$(Threads.nthreads()) thread$s")
function test_atomic(model_constructor, xs, ys; adtype=AutoForwardDiff())
    model = model_constructor(ys)
    println("    $(model.f)")
    # Evaluation: correctness
    print("        eval ")
    ldf = DynamicPPL.Experimental.FastLDF(model, getlogjoint_internal, NotAVarInfo(xs))
    lp1 = LogDensityProblems.logdensity(ldf, xs)
    @test lp1 ≈ expected_logpdf(xs, ys)
    # Evaluation: perf
    display(median(@be LogDensityProblems.logdensity(ldf, xs)))
    # Gradient: correctness
    ldf = DynamicPPL.Experimental.FastLDF(model, getlogjoint_internal, NotAVarInfo(xs); adtype=adtype)
    lp2, grad = LogDensityProblems.logdensity_and_gradient(ldf, xs)
    @test lp2 ≈ expected_logpdf(xs, ys)
    @test grad ≈ expected_grad(xs, ys)
    # Gradient: perf
    print("        grad ")
    display(median(@be LogDensityProblems.logdensity_and_gradient(ldf, xs)))
end

test_atomic(trivial, ones(1), nothing)
test_atomic(threaded_asm_and_obs, ones(50), zeros(50))
test_atomic(threaded_obs_only, ones(1), zeros(50))

# Both PRs (no threadsafe stuff)
#
# 1 thread
#     trivial
#         eval 22.459 ns (1 allocs: 16 bytes)
#         grad 45.738 ns (4 allocs: 128 bytes)
#     threaded_asm_and_obs
#         eval 30.125 μs (770 allocs: 20.156 KiB)
#         grad 207.167 μs (4369 allocs: 449.125 KiB)
#     threaded_obs_only
#         eval 23.792 μs (422 allocs: 12.125 KiB)
#         grad 26.708 μs (428 allocs: 14.766 KiB)
#
# This PR (atomic accumulators)
#
# 4 threads
#     trivial
#         eval 148.840 ns (10 allocs: 272 bytes)
#         grad 116.071 ns (7 allocs: 272 bytes)
#     threaded_asm_and_obs
#         eval 30.125 μs (989 allocs: 28.938 KiB)
#         grad 159.916 μs (3985 allocs: 264.266 KiB)
#     threaded_obs_only
#         eval 17.458 μs (445 allocs: 14.062 KiB)
#         grad 24.375 μs (445 allocs: 14.953 KiB)
#
# PR #1132 (ThreadSafeVarInfo)
#
# 4 threads
#     trivial
#         eval 573.864 ns (19 allocs: 976 bytes)
#         grad 210.262 ns (12 allocs: 1.156 KiB)
#     threaded_asm_and_obs
#         eval 28.958 μs (1194 allocs: 35.047 KiB)
#         grad 117.833 μs (4015 allocs: 533.875 KiB)
#     threaded_obs_only
#         eval 13.354 μs (454 allocs: 15.578 KiB)
#         grad 13.802 μs (450 allocs: 19.156 KiB)

TSVI or atomic accs?

Based on the above benchmarks, it looks like TSVI is faster when something threadsafe needs to be done (especially for gradients). But when there's no threadsafe evaluation needed (trivial model), TSVI introduces way more overhead.

It's not shown above, but I also tried making the trivial model much larger with fifty Normal() variables (in serial). TSVI again introduced way more overhead for evaluation (3.5x more than atomic accs), but oddly enough TSVI gradient was a slight bit faster.

Here's a table with some pros and cons:

	TSVI	Atomic Accs	Explanation
Supports threaded assume with `FastLDF`	✅	✅	True of both because no metadata
Generalises to all accumulators	✅	❌	This PR only 'fixes' logp accumulators. Threadsafe versions of other accumulators must be separately patched. Of course, for LDF this is fine, but ultimately we would have to fix VAIM and others. This is not impossible, but to its credit, TSVI does fix it all in one fell swoop.
Avoids `threadid` indexing	❌	✅	This is the core problem with `Threads.@threads`, and we're quite lucky to not have run into correctness issues so far with TSVI. But that might come back to bite us one day.
Performance	✅	❌	It's a bit of a toss-up, see above. But I think that if we provide a way to opt out or in, then TSVI wins, because it's faster when actually needed. Although TSVI has large overheads for models which don't need it, if you can opt out, that wouldn't be an issue.
Fixes the `convert_eltype` issue	❌	❌	Because the atomic accumulator is a mutable struct, you can't automatically promote its type midway through model evaluation, so the accumulator has to be instantiated with the correct type at the very beginning of evaluation. So you still need the whole `float_type_with_fallback(eltype(params))` thing.
Code complexity	❌	✅	(My opinion!) TSVI is basically a big patch on top of the whole thing, whereas accumulators are the 'correct' place to fix it.

* Make InitContext work with OnlyAccsVarInfo * Do not convert NamedTuple to Dict * remove logging * Enable InitFromPrior and InitFromUniform too * Fix `infer_nested_eltype` invocation

github-actions · 2025-11-12T01:52:56Z

Benchmark Report for Commit `aceec5d`

Computer Information

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

┌───────────────────────┬───────┬─────────────┬───────────────────┬────────┬────────────────┬─────────────────┐
│                 Model │   Dim │  AD Backend │           VarInfo │ Linked │ t(eval)/t(ref) │ t(grad)/t(eval) │
├───────────────────────┼───────┼─────────────┼───────────────────┼────────┼────────────────┼─────────────────┤
│ Simple assume observe │     1 │ forwarddiff │             typed │  false │            6.6 │             1.8 │
│           Smorgasbord │   201 │ forwarddiff │             typed │  false │          739.0 │            43.8 │
│           Smorgasbord │   201 │ forwarddiff │ simple_namedtuple │   true │          438.8 │            56.9 │
│           Smorgasbord │   201 │ forwarddiff │           untyped │   true │          803.5 │            36.0 │
│           Smorgasbord │   201 │ forwarddiff │       simple_dict │   true │         6925.2 │            31.7 │
│           Smorgasbord │   201 │ forwarddiff │      typed_vector │   true │          769.2 │            41.8 │
│           Smorgasbord │   201 │ forwarddiff │    untyped_vector │   true │          792.4 │            37.4 │
│           Smorgasbord │   201 │ reversediff │             typed │   true │          925.7 │            45.9 │
│           Smorgasbord │   201 │    mooncake │             typed │   true │          734.0 │             6.1 │
│           Smorgasbord │   201 │      enzyme │             typed │   true │          916.5 │             4.1 │
│    Loop univariate 1k │  1000 │    mooncake │             typed │   true │         3967.5 │             5.7 │
│       Multivariate 1k │  1000 │    mooncake │             typed │   true │         1034.9 │             8.9 │
│   Loop univariate 10k │ 10000 │    mooncake │             typed │   true │        43600.9 │             5.3 │
│      Multivariate 10k │ 10000 │    mooncake │             typed │   true │         8902.5 │             9.8 │
│               Dynamic │    10 │    mooncake │             typed │   true │          125.4 │            11.1 │
│              Submodel │     1 │    mooncake │             typed │   true │            9.5 │             6.2 │
│                   LDA │    12 │ reversediff │             typed │   true │         1046.9 │             1.9 │
└───────────────────────┴───────┴─────────────┴───────────────────┴────────┴────────────────┴─────────────────┘

mhauru · 2025-11-12T15:55:55Z

But I think that if we provide a way to opt out or in, then TSVI wins, because it's faster when actually needed. Although TSVI has large overheads for models which don't need it, if you can opt out, that wouldn't be an issue.

That's also assuming that people know of and use the opt out. I bet a lot of people wouldn't realise it's there.

I wonder if the overheads for atomic accs might be mostly because of mutable rather than @atomic. Even if so, not sure that helps us avoid them in any way. If I find the time I might do some more benchmarking of this to see how things scale with model size, but the overheads don't look promising.

penelopeysm added 30 commits November 11, 2025 19:47

Fast Log Density Function

5bd9748

Make it work with AD

38d63f4

Optimise performance for identity VarNames

0b475ca

Mark get_range_and_linked as having zero derivative

9123b2b

Update comment

326f4b1

make AD testing / benchmarking use FastLDF

18c2d2e

Fix tests

d5f58f8

Optimise away make_evaluate_args_and_kwargs

5620efe

const func annotation

56a1bbf

Disable benchmarks on non-typed-Metadata-VarInfo

e795d7a

Fix _evaluate!! correctly to handle submodels

2e7c723

Actually fix submodel evaluate

35116b1

Document thoroughly and organise code

abe4068

Support more VarInfos, make it thread-safe (?)

01ba783

fix bug in parsing ranges from metadata/VNV

24a4519

Fix get_param_eltype for TSVI

b36ada6

Disable Enzyme benchmark

379abfd

Don't override _evaluate!!, that breaks ForwardDiff (sometimes)

05d7b95

Move FastLDF to experimental for now

3d206f4

Fix imports, add tests, etc

007da25

More test fixes

1ee190f

Fix imports / tests

0ad4084

Remove AbstractFastEvalContext

1c1ca93

Changelog and patch bump

f2b5624

Add correctness tests, fix imports

4cf8a1b

Concretise parameter vector in tests

7a6d0be

Add zero-allocation tests

587fa45

Add Chairmarks as test dep

045385e

Disable allocations tests on multi-threaded

6e9fc1d

Fast InitContext (#1125)

1fee38f

* Make InitContext work with OnlyAccsVarInfo * Do not convert NamedTuple to Dict * remove logging * Enable InitFromPrior and InitFromUniform too * Fix `infer_nested_eltype` invocation

penelopeysm added 19 commits November 11, 2025 19:47

Refactor FastLDF to use InitContext

40ac87d

note init breaking change

83add24

fix logjac sign

f840dbc

workaround Mooncake segfault

5d54e13

fix changelog too

c129125

Fix get_param_eltype for context stacks

00ad3ac

Add a test for threaded observe

a371b88

Export init

2c3563c

Remove dead code

f12b407

fix transforms for pathological distributions

de88c78

Tidy up loads of things

1156a49

fix typed_identity spelling

a3a4795

fix definition order

cda780a

Improve docstrings

99a4a20

Remove stray comment

b58f90a

export get_param_eltype (unfortunatley)

bf00ca1

Add more comment

6e8c146

Update comment

62a8746

Atomic accumulators

aceec5d

github-actions bot assigned penelopeysm Nov 12, 2025

penelopeysm marked this pull request as draft November 12, 2025 01:37

Base automatically changed from py/ldf to breaking November 13, 2025 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomic accumulators #1137

Atomic accumulators #1137

Uh oh!

penelopeysm commented Nov 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

mhauru commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Atomic accumulators #1137

Are you sure you want to change the base?

Atomic accumulators #1137

Uh oh!

Conversation

penelopeysm commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

Performance

TSVI or atomic accs?

Uh oh!

github-actions bot commented Nov 12, 2025

Benchmark Report for Commit aceec5d

Computer Information

Benchmark Results

Uh oh!

mhauru commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

penelopeysm commented Nov 12, 2025 •

edited

Loading

Benchmark Report for Commit `aceec5d`